Last time we translated a story into a statistical model. In general, a model looks like this:
\[y = \beta_0 + \beta_1 x + \epsilon\]
Because \(y = \beta_0 + \beta_1 x\) is a linear equation, adding \(\epsilon\) makes this a linear model.
But what kind of error?
If the error includes every other explanatory variable beyond the X that we’ve included in the model, what does the sum of all of their effects look like?
We have to assume \(\epsilon\) is distributed a certain way, typically:
\[\epsilon \sim Normal(0, 1)\]
The normal distribution, and every other probability distribution, has a few essential components:
The normal distribution is one of many probability distributions. Here’s the uniform distribution.
Okay, but why should we use a linear model and assume \(\epsilon \sim Normal(0, 1)\)?
A linear model with normal error and a continuous outcome \(y\) is known as a regression.
We now have outlined a complete model:
\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]
For illustration, let’s assume that our model captures the full story, choose values for the parameters, and generate or simulate data using the model.
How do we simulate data from this?
\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]
How do we simulate data from this?
\[y = \beta_0 + \beta_1 x + \epsilon, \text{ where } \epsilon \sim Normal(0, 1)\]
Whenever we are using randomization, for example, when simulating data or running a model that relies on randomization, we want to use set.seed() so we can get the same results every time we render.
# Random number are, well, random. rnorm(1)
## [1] -1.431754
rnorm(1)
## [1] -0.6508611
rnorm(1)
## [1] -1.824262
# But using set.seed() we at least start at the *same* random number each time. set.seed(42) rnorm(1)
## [1] 1.370958
set.seed(42) rnorm(1)
## [1] 1.370958
set.seed(42) rnorm(1)
## [1] 1.370958
# Load packages. library(tidyverse) # Set the randomization seed to simulate the same data each time. set.seed(42) # Set the parameter values. beta0 <- 3 beta1 <- 7 # Simulate data. sim_data <- tibble( x = runif(100, min = 0, max = 7), y = beta0 + beta1 * x + rnorm(100, mean = 0, sd = 3) )
sim_data |> ggplot(aes(x = x, y = y)) + geom_point()
Okay, what have we talked about so far?
Our goal is to use the model to estimate the unobserved parameters from the data (i.e., make our best guess).
In other words, an inferential model extracts parameter estimates from the data to inform our managerial decision.
The best line should be the one that makes the sum of the vertical bars as small as possible.
The vertical bars are called residuals, and represent the distance between the data \(y\) and a particular line.
Residuals can be positive and negative, so we make the sum of the squared residuals as small as possible.
The tidymodels framework is a collection of packages for modeling using tidyverse principles. The goal is to provide the same consistency and ease-of-use for modeling that the tidyverse provides for importing, wrangling, visualizing, and reporting data.
# Load packages. library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.3.0 ──
## ✔ broom 1.0.9 ✔ rsample 1.3.1 ## ✔ dials 1.4.1 ✔ tune 2.0.0 ## ✔ infer 1.0.9 ✔ workflows 1.3.0 ## ✔ modeldata 1.5.1 ✔ workflowsets 1.1.1 ## ✔ parsnip 1.3.3 ✔ yardstick 1.3.2 ## ✔ recipes 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ── ## ✖ scales::discard() masks purrr::discard() ## ✖ dplyr::filter() masks stats::filter() ## ✖ recipes::fixed() masks stringr::fixed() ## ✖ dplyr::lag() masks stats::lag() ## ✖ yardstick::spec() masks readr::spec() ## ✖ recipes::step() masks stats::step()
Yes! But you’ll notice that the way we write code stays remarkably consistent.
In the tidymodels framework, we specify the model type and then set the engine we’d like to estimate the model with. The engine is typically another R package that would normally require its own unique syntax.
This allows us to use a variety of modeling packages while keeping the same syntax.
# Specify the model type and engine.
reg_model <- linear_reg() |>
set_engine("lm")